-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(gthread): reduce unnecessary polling #3319
Conversation
gthread calls epoll_wait (and 2 other syscalls) every second because it specifies timeout to be 1 second. ``` λ sudo strace -p `pgrep -f "gunicorn: worker" | head -n1` strace: Process 30815 attached epoll_wait(7, [], 1, 666) = 0 getppid() = 30800 utimensat(6, NULL, [{tv_sec=3157, tv_nsec=198136276} /* 1970-01-01T06:22:37.198136276+0530 */, {tv_sec=3157, tv_nsec=198136276} /* 1970-01-01T06:22:37.198136276+0530 */], 0) = 0 epoll_wait(7, [], 1, 1000) = 0 getppid() = 30800 utimensat(6, NULL, [{tv_sec=3158, tv_nsec=204192934} /* 1970-01-01T06:22:38.204192934+0530 */, {tv_sec=3158, tv_nsec=204192934} /* 1970-01-01T06:22:38.204192934+0530 */], 0) = 0 epoll_wait(7, [], 1, 1000) = 0 getppid() = 30800 utimensat(6, NULL, [{tv_sec=3159, tv_nsec=210145196} /* 1970-01-01T06:22:39.210145196+0530 */, {tv_sec=3159, tv_nsec=210145196} /* 1970-01-01T06:22:39.210145196+0530 */], 0) = 0 epoll_wait(7, [], 1, 1000) = 0 getppid() = 30800 utimensat(6, NULL, [{tv_sec=3160, tv_nsec=215517372} /* 1970-01-01T06:22:40.215517372+0530 */, {tv_sec=3160, tv_nsec=215517372} /* 1970-01-01T06:22:40.215517372+0530 */], 0) = 0 epoll_wait(7, ^Cstrace: Process 30815 detached <detached ...> ``` Timing out every second wakes up the process and loads it on CPU even if there is nothing to service. This can be detrimental when you have total workers >> total cores and multi-tenant setup where certain tenants might be sitting idle. (but not "idle enough" because of 1s polling timeout) This can possibly keep a completed future in queue for a small while, but I don't see any obious problem with it except few bytes of extra memory usage. I could be wrong here. fixes benoitc#3317
There is a mismatch between the title and content. I suspect you meant to apply some fraction of the timeout. I wonder if TTFB for a burst of request after long idle is meaningfully impacted by cleaning up keep-alive sockets all at once. Either also consider the potentially lower poll time for that, or document the changed |
@pajod timeout is divided by 2 when the worker is created. So self.timeout is half of request timeout. Lines 590 to 592 in 787c914
I'll look into 2nd part of your comment in some time. |
@pajod I think best course of action is to keep timeout as
|
When worker goes idle, two problems can occur: 1. keepalived connections might stay open for request_timeout / 2. That can be significantly longer than keep-alive timeout. 2. When next request comes up after idle period, all keepalived sockets must be cleaned up first before we can start responding to requests. This will increase TTFB. After this change: 1. keepalived sockets are cleaned up with frequency of *at least* keepalive timeout. 2. TTFB is minimally affected. 3. Worst case: worker goes idle with keepalived connection that was almost hitting timeout . That connection will be kept alive for ~2x keepalive timeout.
@benoitc kind reminder to review this, small change. :) |
Could you use timer file descriptors instead? Then the keep alive timeout could be enforced accurately and polling would be unnecessary. |
gthread calls
epoll_wait
(and 2 other syscalls) every second because itspecifies timeout to be 1 second.
Timing out every second wakes up the process and loads it on CPU even
if there is nothing to service.
This can be detrimental when you have
total workers >> total cores
anda multi-tenant setup where most tenants might be sitting idle. (but not
"idle enough" because of 1s polling timeout)
This can possibly keep a completed future in the queue until the next
request arrives, but I don't see any obvious problem with it except a few
bytes of extra memory usage? I could be wrong here, please check this.
fixes #3317 (more details on issue)